Design optimization of groundwater circulation well based on numerical simulation and machine learning

The optimal design of groundwater circulation wells (GCWs) is challenging. The key to purifying groundwater using this technique is its proficiency and productivity. However, traditional numerical simulation methods are limited by long modeling times, random optimization schemes, and optimization results that are not comprehensive. To address these issues, this study introduced an innovative approach for the optimal design of a GCW using machine learning methods. The FloPy package was used to create and implement the MODFLOW and MODPATH models. Subsequently, the formulated models were employed to calculate the characteristic indicators of the effectiveness of the GCW operation, including the radius of influence (R) and the ratio of particle recovery (Pr). A detailed collection of 3000 datasets, including measures of operational efficiency and key elements in machine learning, was meticulously compiled into documents through model execution. The optimization models were trained and evaluated using multiple linear regression (MLR), artificial neural networks (ANN), and support vector machines (SVM). The models produced by the three approaches exhibited notable correlations between anticipated outcomes and datasets. For the optimal design of circulating well parameters, machine learning methods not only improve the optimization speed, but also expand the scope of parameter optimization. Consequently, these models were applied to optimize the configuration of the GCW at a site in Xi’an. The optimal scheme for R (Q = 293.17 m3/d, a = 6.09 m, L = 7.28 m) and optimal scheme for Pr (Q = 300 m3/d, a = 3.64 m, L = 1 m) were obtained. The combination of numerical simulations and machine learning is an effective tool for optimizing and predicting the GCW remediation effect.


Materials and methods
The method of GCW optimization can generally be divided into three main steps (Fig. 1).The first step is numerical simulation, in which datasets for machine learning are obtained.The second step is machine learning, in which the acquired datasets are used to train the models for optimizing the parameters of GCW.In this research, Multiple Linear Regression (MLR), Artificial Neural Networks (ANN), and Support Vector Machine Table 1.Features of the three tested machine learning models.

Model
Author(s) Advantages Disadvantages

MLR Korkmaz, M
Easily establishes linear causality among various variable groups, simplifying analysis Overlooks the interaction effect and nonlinear causality ANN Gupta, A. K. and Guntuku, S. C Powerful nonlinear mapping capability, allowing the approximation of any nonlinear continuous function ANN models with missing physical models may be significantly incorrect SVM Esteki, S. and Naghsh-Nilchi, A. R Establishes linear causal relationships between multiple sets of variables; can be introduced through a kernel function in the case of nonlinearity to map it to linear Low-dimensional calculation (SVM) were applied to train and appraise the optimization models.In the third step, the parameters of the GCW are optimized for the test site.

Confirmation of characterization indicators and influence factors
1. Indicators of circulation efficiency: Typically, the flow field induced by GCW exhibits the traits depicted in Fig. 2. It can be segmented into three parts, the upstream capture zone, the circulation zone, and the downstream release zone 7.In order to characterize of groundwater features surround the GCW, two indicators are usually applied 8,13,16 : the radius of influence (R) and the particle recovery ratio (P r ).

① Radius of influence (R):
The variable R plays a crucial role in defining the range of influence within the circulation zone.This represented the greatest horizontal separation from the circulation zone's edge to the axis of the well.The hydraulic gradient alters the form of the circulation area, leading to fluctuating R values in different orientations.Therefore, there is a variance in the radius both along and at right angles to the groundwater flow.The indicator R, identified as the upstream radius parallel to the hydraulic gradient, is ascertainable through the computation of the distance between the particle migration trajectory in the particle tracking model.② Particle recovery (P r ): The variable P r serves as a measure for the groundwater's capacity to be captured by the extraction screen.Groundwater from the injection screen moves towards two areas: the extraction screens and the downstream release zone.To quantify the indicator via numerical simulation, the results of particle tracking were calculated by MODPATH 12,17 .The value of P r can be expressed as the proportion of particles in the circulation zone to the total number (N cycle /N total ).The schematic diagram of the indicators is shown in Fig. 3.

Influence factors:
The success of GCW operations largely stems forms the hydrogeological conditions of the remediation site.In addition, the configuration and operation mode of GCW are also crucial 8,28 .This study focused on identifying key factors influencing of GCW's operations and their corresponding indicators of characterization.The key elements are listed below.
① Hydrogeological parameters: Typically, hydrogeological parameters are employed to assess the appropriateness of the GCW techniques.This research primarily focused on defining hydrogeological parameters such as the horizontal conductivity (K H ), vertical hydraulic conductivity (K V ), the vertical heterogeneity (K H /K V ), aquifer thickness (M), porosity (n), specific yield (μ), and hydraulic gradient (I).② Structure and operation parameters: The distance between the top of the upper screen section and the bottom of the lower screen section (L) and the total length of the two screen sections (a) were used as the indicators of the circulation well structure parameters; the pumping rate (Q) was used as the main indicator of operating parameters 12,16 .

Development of database for machine learning
The machine learning database was developed utilizing the FloPy package in conjunction with Python.A variety of tools exist for developing models, encompassing Python-based packages for plotting, manipulating arrays, optimizing, and analyzing data.In particular, FloPy was chosen for creating GCW numerical models due to its adaptability in handling MODFLOW and MODPATH packages via coding 29 .By entering varied parameter values, we derived diverse characterization indicators for GCW, leading to the creation of a database.Ultimately, the FloPy program efficiently produced over 3000 samples suitable for machine learning applications.An in-depth account of how the database was developed is provided in the supplementary material.www.nature.com/scientificreports/

Machine learning approaches
According to the views of formal researchers and the preliminary work of this study 30,31 , the key indicators R and P r were designed as the output variables.The influence variables M, Q, I, μ, n, a, and L were set as the input variables.A reliable and effective database is important to the performance and the conclusion of machine learning.So, data cleaning before training plays a crucial role in the realm of machine learning.Box plots were used to eliminate abnormal values, and NaN (not a number), rather than numerical ones, were removed.Following the cleaning of data, the database was divided randomly into two groups; after several trials, the model results were found to be optimal for 80% and 20% of the data assigned to the training and test sets, respectively.Utilizing the training dataset, the model was trained to derive the functions linking the input and output variables.The test dataset was applied to assess the model's forecasting capabilities.

Database preprocessing
The application of computer technology to imitate human learning activities is a relatively new field of research 18,32 .A variety of analytical techniques are utilized within machine learning algorithms to construct related models.Each of these models is employed to deduce new tasks form the data.Generally, machine learning algorithms can be categorized into two types based on their modeling methods: supervised and unsupervised learning.Supervised learning involves training a model to elucidate the link between feature variables and their results.Conversely, in the realm into the unexplored configurations of a specific given dataset 33 .The aim of this research was to explore how characterization indicators correlate with influence factors.It is a typical regression problem.Consequently, the method involving partially supervised learning method was applied.MLR, ANN, and SVM serve as effective techniques in addressing regression issues.

Model training
Python, known for its readability, interactivity, and cross-platform nature, excels in code development efficiency.Scikit-learn is a package of Python that integrates a variety of advanced machine learning algorithms and can be used to solve medium-scale supervised and unsupervised problems 34,35 .In this research, three distinct algorithms (MLR, ANN, SVM) for model training within the Scikit-learn package were adopted.The theory of the algorithms are as follows.

Multiple linear regression
Multiple linear regression (MLR) is the most common method for determining the linear relationship between input and output variables when handling features with limited data.The MLR method was applied to find a linear correlation between input and output variables.The mathematical form is as follows 25 : where ŷ is the regressor; b i (i = 0, 1, 2, …, n) is the coefficient of each variable, which represents the weight of the variable;x i (i = 1, 2, …, n) is the input variable of the regression; c is the intercept term.Through continuous training, the values of b i and c are confirmed according to the minimization of the fitting error between the forecasted and actual value.The key of MLR method is the least squares method which is widely employed to estimate the parameters by fitting a function to a set of measured data.This approach seeks to identify the best outcome when the sum of squares error (SSE) is minimized.SSE can be defined as follows: where SSE values approaching zero indicate the closeness of estimated parameters to the actual value.If f (x i , β i ) is linear, then it is a linear least square.The least squares model can be solved by employing simple calculus.However, if f (x i , β i ) is nonlinear, it can be solved by an iterative numerical approach 30.

Artificial neural networks
Artificial neural networks (ANN) can execute learning and prediction functions through the emulation of human learning processes.ANN is capable of identifying links between input and output data while forming and fortifying neurons connections 26,36 .An algorithm based on a multi-layer perception neural network could rank as the top choice among artificial neural network algorithms.Due to its capability to tackle complex regression problems, this research opted to develop artificial neural networks featuring input, output, and hidden layers to enhance GCW optimization (Fig. 4).
Neurons linked in unison convert the input data into output values.In the input layer, seven neurons, symbolizing seven input variables, were established.A single neuron, symbolizing an individual goal for each predictive issue, was set in the output layer.Numerous experiments were conducted to ascertain the optimal hyper-parameters.To forecast R and Pr, four hidden layers were set, starting with 256 neurons, followed by 128 neurons in the next layer, 64 neurons in the third, and 32 neurons in the fourth.The design of the model can be described as follows: (1) where x t−i+1 , i = 1, ..., m represents the element of the input vector X t ; w ji is the weight determining the relationship between the nodes; θ 0 is the bias of the output node; f (•) is the transfer function.Following extensive experimentation, the optimal hyper-parameter was ascertained.The Rectified Linear Unit (ReLU) serves as the ideal transfer function for forecasting R (Eq. 5).Yet, in predicting P r , tanh (Eq.6) ought to serve as the optimal transfer function.By utilizing neural networks, the value of loss progressively diminishes and reaches stability following 110 interactions.Consequently, the maximum number of iterations that yield reliable predictive outcomes for models ought to be 110.

Support vector machines
The Support Vector Machines (SVM) employs the kernel functions to convert the data into a hyperspace, enabling the representation of intricate patterns [36][37][38][39] .With the emerging hyperspace, SVMs aim to develop a hyperplane suitable for categorizing and constructing the broadest data margin, or one that accommodates data with minimal complexity and reduced empirical risk associated with the modelling function 27 .SVMs have been applied recently for many purposes in the field of hydrogeology [40][41][42] .In this study, the training data can be presented as {(x i , y i ), i = 1, 2,3, …, n}, where x is the input variable, and y is the output variable.A loss function offered by the SVM can be delineated in the following manner [43][44][45] : The issue with SVM can be characterized as the following optimization problem: where φ(x) is a kernel function designed for projecting the data into a hyperspace; 1 2 ω 2 stands for generalization; C n i=1 ξ i + ξ * i represents empirical risk; ξ i and ξ * i are slack variables for measuring "below" and "above" the ε tube (Fig. 5).Slack variables hold positive values while C remains a positive constant.
(4) where ŷi is the estimated value; y i is the actual value; n is the number of actual values.

Application on test site
The test site for this study is located in Xi'an City, Shaanxi Province (Fig. 6).The area has a warm temperate continental monsoon climate, with an annual average temperature of 13.6 °C and an average annual precipitation of 732.9 mm, mainly from July to September.According to the results of hydrogeological surveys and pumping tests, the sediments at this site are mainly coarse sand, interspersed with medium and fine sand mixed with pebbles.The groundwater depth is 13.03 m.The vertical hydraulic conductivity (K V ) is 8.33 m/d.The aquifer anisotropy of conductivity (K H /K V ) is 3. Its thickness (M) stands at 14.57 m.The hydraulic gradient (I) is 0.00357.The Quaternary porous aquifer is the target aquifer for its substantial water supply capabilities.According to the information provided, the aforementioned models are suitable for enhancing the GCW parameters at the test site.
Utilizing the particle swarm optimization (PSO) algorithm, the optimal value was identified post-training models, considering its straightforwardness and superior efficacy 41, 42.It is inspired by social behavior in nature such as bird flocking.Initially, PSO consists of a multitude of particles moving through the universe, organized to identify the best solution.During each cycle, pbest and gbest modify every particle.The pbest is represents the optimal solution a particle has achieved to date, whereas gbest stands as the universally optimal solution for any particle.Once pbest and gbest have been identified, the velocity and positions of the particles can be modified in the following manner: where V [•] means the particle velocity; present[•] represents the current particle; rand(•) is a random number between 0 and 1; c 1 and c 2 are defined as the learning factors, all operations of PSO algorithm are written in Python.
The aim of optimizing GCW is to identify the best solutions for models that exhibit the most rational R and Pr values.Enhancing R may expand the range of impact, whereas boosting P r could improve the effectiveness www.nature.com/scientificreports/ of corrective measures.Aimed at optimizing the values of R and P r , the particle swarm algorithm was employed to identify the most suitable solution.For each defining parameter, the fitness function was derived from the design optimization model.

Institutional review board statement
This study did not require ethical approval; all data are available in the public domain.

Model calibration
A groundwater numerical simulation model influenced by GCW in the test site was established based on the hydrogeological survey and pumping test.The GCW was set in the center of the model, along with four observation wells been placed alongside GCW with 5 m interval in the downstream direction.The wells were used to monitor the alteration in groundwater levels around the GCW.The observation data of water level variation and numerical simulation model's calculated data were both fitted to the water level in this study.forward method was used to adjust the numerical model's parameters until the simulated and measured values were in close agreement.the adjusted parameters of the site are shown in Table 2.
The comparative outcomes of the variation of observed and the simulated water level are shown in Fig. 7.It is evident that the degree of fitting is substantial.Consequently, the proven reliability of the existing numerical simulation model is evident, suggesting that both the model and its parameters can precisely mirror the present situation, paving the way for future research.

Database description
The database developed through numerical models in this study cover different media, aquifer thickness, and the hydrogeological parameters compositions apt for GCW 21.In these models, Q ranged from 12 to 299 m 3 /d, M     ranged from 1.5 to 35 m, K H /K V ranged from 3 to 10, K V ranged from 0.5 to 34 m/d (Table 3).Figure 8 illustrates the spread of key parameters within a three-dimensional framework.The data points, symbolizing a segment of the database, are uniformly and extensively dispersed, signifying the dependability and representativeness of the database developed through numerical modeling.

Results of different models
The R 2 and RMSE values for the SVM, ANN, and MLR models are shown in Table 4. Additionally, we entered the parameters of the testing site into the trained model to get the anticipated values of R 2 and P r .Compared them with the observed data, the results are shown in Fig. 9.By comparing the prediction accuracy evaluation metrics R 2 and RMSE, along with the scatterplots for these three models, reveals that the SVM model showed the best performance for the prediction of R 2 .Despite this, the ANN model exhibited enhanced precision in predicting Pr.As a result, the SVM and ANN models appear suitable for sequentially forecasting R and Pr throughout the design enhancement of GCW.An analysis of the scatter points from the three distinct methods reveals a closer grouping of the SVM model's points around the y = x target line.For predicting R, the SVM model achieves the highest R 2 value and the RMSE value nearest to 0. In predicting P r , the distribution of scatter points in the ANN model is denser along the target line.The ANN model has exhibits the highest R 2 value and RMSE value nearest to 0. Therefore, the SVM model shows higher performance in the prediction of R, while the ANN model excels in predicting Pr.Compared with the previous two methods, the MLR model performs lower precision in forecasting the two indicators.
1.The MLR algorithm demonstrates greater efficacy in training scenarios with restricted datasets.Conversely, the SVM and ANN algorithms are superior in training with extensive datasets, more accurately reflecting the interaction between input and output variables 36 .Such results confirm that the optimized design of GCW is not a simple linear relationship between the parameters, but is instead a complex regression problem.2. The simplicity of the MLR algorithm is notable, accompanied by a certain level of underfitting in its outcomes.
Conversely, the SVM and ANN models provide an appropriate degree of complexity.In the context of these models, neither underfitting nor overfitting happens during the data training phase.Furthermore, their capacity for making generalize are comparatively superior 37 .www.nature.com/scientificreports/ 3.After comparing the three models, we found that the SVM model predicts R and the ANN model predicts P r with the best results; in future studies, the two models should be integrated to predict the effects of different optimization indicators.

Optimization of the GCW structure at the test site
To determine the optimal solution rapidly and accurately, a total of 1000 particles were set up.The learning factors c 1 , c 2 were set to be 2.0.According to the site's actual circumstances, specific ranges of each parameter were set.After 50 iterations, the solutions of each design optimization model were verified.The iterative processes of PSO are shown in Fig. 10.For R and P r , there was a fluctuation in the objective value before 10 interactions, followed by a tendency to stabilize afterwards.After 50 iterations, the target value remained almost constant.Therefore, the solution can be used as the optimization schemes after 50 iterations, evidently showing effective PSO convergence and the suitability of the chosen parameters.
The key parameters for optimization scheme are shown in Table 5.For the optimization of R, the pumping rate should be high and a relatively long distance should be maintained between the two screens in order to push particles to a far position.For the optimization of P r , it's crucial to keep the pumping rate elevated and keep a minimal gap between the screens to ensure that the particles circulate in a small space with high recovery.
Before the design optimization, several numerical models were established to set R and P r with different design parameters.According to the results of the numerical simulation, a scheme with best effect was selected as an initial scheme for each indicator (Q = 200 m 3 /d, a = 6 m, L = 3 m, R = 17.85 m, P r = 0.86).Comparing the initial and optimized scheme, R increased from 17.85 to 25.915, accounting for an improvement of 45%, P r increased from 0.86 to 0.941, accounting for an improvement of 9.4%.Both R and P r increased after optimization, reflecting the effectiveness of the design optimization based on machine learning and numerical simulation.
This study proposes two optimization schemes for R and P r respectively.The schemes provide designers with diverse options.In cases where the test location exhibits a broad spectrum of contamination, the optimization scheme for R is suggested to be selected for enhancing the remediation range of the GCW.When the spread of contaminants is limited and the remediation duration is urgent, the optimal scheme for P r should be chosen to enhance the remediation efficiency.

Conclusions
This paper presents a novel design approach for optimization of GCW by combining the numerical simulation and machine learning.Numerical simulation is proved to be a superior and more economical method for data collection compared to typically lengthy and expensive lab and field tests.This method enables the rapid collection of comprehensive datasets for machine learning applications.The dataset is developed based on the prevailing conventional hydrogeological circumstances.In order to expand the implementation of the proposed schemes to additional test sites in the future, the simulation of data can be conducted by considering specific hydrogeological conditions of the respective remediation sites.By exerting these efforts, machine learning models can be made more dependable and precise.This finding of this research is as follows:   1.With the consideration of the unpredictability of hydrogeological parameters, configuration of the well, and operating parameters, the MLR, ANN, and SVM-based predictive model in the machine learning algorithm exhibits excellent compatibility with the numerical simulation model of GCW.The input-output relationship of the groundwater simulation model can be accurately represented.By designing GCW optimally, we can develop machine learning models to lessen computational demands.2. The operation efficacy of GCW plays a crucial role in groundwater remediation.In this study, two optimal strategies were implemented for GCW optimization at a test site in Xi'an.The schemes are derived from numerical simulation and machine learning using data of the test site.In order to apply schemes to other sites, only the hydrogeological parameters (M, K V , K H /K V , I) and the range of design parameters (Q, a, L) of the site need to be determined.By employing the trained model, we can perform calculations to improve the design of the GCW structure for specific locations.This approach provides a practical and highly efficient method for optimizing the design of GCW.
As a remediation technology of contaminated groundwater, the operational effectiveness of GCW is affected by the water quality and sediment concentration.However, the operation of GCW continues to face significant challenges due to hydrodynamic factors.During pumping and injection process, there is an increase in the hydraulic gradient of the groundwater flow field, propelling the movement of pollutant.Furthermore, it is essential to cleanse the circulation well while constructing it and use filter media around the edges of the pumping and injection screens to ensure the well's sediment levels remain low.To sum up, the study concentrated on hydraulic circulation's impact on GCW functioning, disregarding the influence of water quality and sediment density.
In spite of the simplified conceptual model composed of homogeneous and geometrically regular aquifer, this study is of great importance due to its considerably new scientific and practical application.In the future, more intricate situations may be considered, such as the direction of water flow.As the complexity of aquifer conditions increases, it may significantly increase the nonlinearity of the input-output relationship of simulation model.In order to improve applicability of designed GCW in complex site, deep-learning method and multiple-objective optimization model will be reached in future.

Figure 3 .
Figure 3.The schematic diagram of the indicators.

Figure 4 .
Figure 4. Structure of ANN for optimization of GCW.

Figure 6 .
Figure 6.Location of the GCW and the lithology of the test site.

Figure 7 .Table 3 .
Figure 7.The variation of observed and the simulated water level.

Figure 8 .
Figure 8. Distribution of main parameters in the database.

Figure 10 .
Figure 10.Changes in objective values with iterations.

Table 4 .
Coefficient of determination (R 2 ) and RMSE for the indicators of each model.

Table 5 .
Optimal solution of each characterization parameter.